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Abstract —In this work, a multiple-expert binarization frame¬ 
work for multispectral images is proposed. The framework 
is based on a constrained subspace selection limited to the 
spectral bands combined with state-of-the-art gray-level bina¬ 
rization methods. The framework uses a binarization wrapper to 
enhance the performance of the gray-level binarization. Nonlinear 
preprocessing of the individual spectral bands is used to enhance 
the textual information. An evolutionary optimizer is considered 
to obtain the optimal and some suboptimal 3-band subspaces from 
which an ensemble of experts is then formed. The framework is 
applied to a ground truth multispectral dataset with promising 
results. In addition, a generalization to the cross-validation 
approach is developed that not only evaluates generalizability of 
the framework, it also provides a practical instance of the selected 
experts that could be then applied to unseen inputs despite the 
small size of the given ground truth dataset. 

Keywords — Multispectral , Ancient Manuscripts, Binarization. 

I. Introduction 

Digitization and computer-based archiving of ancient 
manuscripts has been of great interest to bring the documented 
human heritage to the public access and also to produce 
intangible replications of this heritage in order to preserve 
it beyond the limits of its physical carriers [1], [2], [S1]— 
[S3]. 1 Multispectral (MS) imaging has been used toward these 
goals considering its high capability to record data beyond 
what is ‘visible’ to human eye [3]-[6]. An MS image could 
be imagined as a generalized color image with more than three 
bands. However, in practice this may not accurately hold, and 
there is a big difference between a color image generated 
by broadband (with FWHM 2 > 60 nm) filters calibrated to 
reproduce the same visual sensation for human eye compared 
to an MS image generated by a series of intermediate-band 
(FWHM - 11 - 60 nm) or narrowband (FWHM - 4 - 10 nm) 
filters. In addition, various challenges are associated with the 
MS imaging such as i) nonlinear misregistration among bands, 
ii) high IR noise, and iii) bigger amount of data. 

Segmentation and binarization of MS images could stand 
as a convergent point between MS image processing and 
well-studied color/gray document image processing. A great 
obstacle in this direction is the labor cost associated with 
creating reference data, especially considering high volume 
of data contained in MS images. This has been resulted in 
indirect evaluations of the performance of enhancement and 
segmentation methods of MS image using OCR or other goal- 
oriented approaches [6]. With the availability of ground-truth 
datasets of MS images of ancient manuscripts [3], developing 
direct binarization methods of MS images has been pursued 

1 Because of the limited space, the images and some citations are provided 
in the Supplementary Material and in the Postscript, which is accessible at 
http://arxiv.org/pdf/1502.01199.pdf#page=6. 2 FWHM: Full width at half 

maximum [7]. 
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Fig. 1: The proposed multiple-expert binarization framework. 



Fig. 2: A binarization wrapper for any given binarization kernel. 


in a more systematic way. In this work, a multiple-expert 
framework to binarize MS images is proposed. The framework 
uses a subspace selection of MS bands along with a given 
state-of-the-art gray-level binarization method, which we call 
the kernel. In order to limit the scope of the framework, it is as¬ 
sumed that the kernels are smart enough to adjust their internal 
parameters for each individual input gray-level image that they 
receive. In the future, the framework is extended to include 
optimal selection of the internal parameters the kernels along 
with the band selection. The framework is multiple expert in 
the sense that it considers various instances of subspaces in 
the form of an ensemble of experts, and then simply combines 
their results. More complex ways of combining the opinions 
of the multiple experts, such as Unsupervised Ensemble of 
Experts Reduction (UEoER) approach [8] and its possible 
quality-aware generalizations, are not considered in this work. 





































































Each expert will be associated with a set of selected bands 
and also the way these bands are converted to a single-band (a 
gray-level) image. In this work, we assume that every subspace 
has a dimension of 3, i.e., three bands are selected by every 
expert, and it also assumed that the selected 3-band images 
are converted in a gray-level image using the traditional gray 
conversion of the RGB color space [9]. 3 

The paper is organized as follows. In Section II, the 
notations and also two kernel binarization methods used in the 
experiments are provided. The main framework and its com¬ 
ponents are described in Section IV. The generalized cross- 
validation approach is presented in Section V followed by the 
experimental results of Section VI. Finally, the conclusions and 
some prospects for the future are listed in Section VII. 

II. Basic Definitions 

In this section, some of the basic concepts are first defined: 

1) Multispectral Image: A multispectral (MS) image in this 

work is composed of 8 bands recorded using a multi¬ 
spectral camera (DTA s.r.l. Chroma C3). The bands are 
produced using a series of filters at 340 nm (florescence), 
450 nm (blue), 550 nm (green), 650 nm (red), 800 nm, 
900 nm, 1,000 nm, and 1,100 nm. The visible filters 
are broadband at FWHM = 80 nm, while the infrared 
(IR) filters are intermediate-band at FWHM = 50 nm. 
The camera sensor is a two-phase full-frame low-dark- 
current CCD (KAF-6303E). Each band of an MS image 
is recorded in the BW01 protocol, i.e., the black has a 
value of 0, and white corresponds to 1 [11]. An MS image 
is denoted by a Aband-tuple u, u = a 1 nd , where 

i = 1, • • • , TVband is the band counter, 7V ban d (= 8) is the 
number of spectral bands, and £ is a pixel on the image 
domain !i C R 2 , u : ii-> [0,l] A ' band . 

2) Gray-level Image: In this work, a gray-level image is 
represented by I in general, and it is assumed that it 
follows the BW10 protocol [11] (black is 1 and white 
is 0): I : SI -A [0,1]. 

A. Binarization Methods 

Two state-of-the-art binarization methods are considered in 
this work as the kernel binarization methods: 

1) Laplacian Energy [12] (LE in short): The Laplacian-energy 
method, inspired by a Markov random field model, defines the 
binarization as a minimization problem for a global energy 
function. The fidelity term is defined based on the intensity 
Laplacian that is highly contrast- and intensity-independent. 
Moreover, the edge information is used to ensure that the 
binarization boundaries are aligned with edges. Optimization of 
four internal parameters is considered in the Laplacian-energy 
method: The two hysteresis thresholds of the Canny edge map, 
the radius of the Gaussian filter, and the mismatch penalty. 

2) Phase Congruency [13] (PC in short): This method uses 
a combination of phase feature maps, such as the maximum 
moment of phase congruency covariance (MMPCC) and the 
local weighted mean phase angle (LWMPA) and the regional 
minima feature maps, and also adaptive Gaussian and median 

3 In the future, other color-to-gray methods, such as the Dual Transform 
[10], will be also considered. Some implementations can be found here: 
http://www.mathworks.com/matlabcentral/fileexchange/27578-universal-color 
-to-gray-conversion. 


filtering in order to provide a robust and consistent binarization 
performance for various types of degradation. The method has 
three explicit internal parameters: The number of scales, the 
number of orientations of the wavelet transform, and also the 
threshold of noise standard deviation. In this work, optimization 
of these parameters for every input image is not considered, 
and a fixed set of optimal values for these parameters obtained 
using DIBCO series datasets [14] is used. The adaptability of the 
internal processes is main capability of this method to optimize 
its performance across various degradation types. 

III. The Dataset 

One of the ground truth datasets of MS images provided 
in [3] is used in this work. 4 The dataset composed of 21 MS 
images of ancient manuscripts, which follow the description 
of MS images provided in Section II. Every MS image is 
accompanied with a binary image that provides segmentation 
of the text on the associated manuscript image. 

IV. The Proposed Multiple-Expert Framework 

The schematic diagram of the framework is shown in 
Figure 1. The framework receives a set of given ground truth 
MS images and a gray-level binarization method (the kernel). 
The spectral bands of the MS images are first enhanced 
using the Gray-Expand transform [15] in order to increase 
the differentiability of the textual information. The kernel 
binarization is then wrapped (see Section IV-A), and using an 
evolutionary optimizer [16], the individual-best 3Bs of every 
MS image along with their tailing suboptimal 3Bs are obtained 
(see Section IV-B). Then, the rare-or-frequent 3Bs are chosen 
to create an ensemble of 3B experts (See Section IV-C). Having 
such an ensemble, any new unseen MS image is passed through 
the preprocessing and then through the binarization wrapper 
for every member of the ensemble, and the final binarization 
output is obtained by combining all the binary outputs in 
a simple averaging step. The details of various steps of the 
framework are presented in the following subsections. 

A. The Proposed Binarization Wrapper Method 

The wrapper adds three features to every kernel that it 
hosts: i) It passes individual bands of the input color/multiband 
image through a blurring/deblurring process in order to mini¬ 
mize the registration/mismatching error among bands, ii) , and 
iii) after obtaining the output of the kernel on the gray-level 
image, it performs a test to ensure the kernel is not trapped on 
some small regions of the input image. 

For blurring and deblurring steps, a Gaussian profile with 
g = 0.5 and a radius of 5 pixels and another Gaussian profile 
with g — 5 and a radius of 5 pixels are respectfully used. The 
luminance color-to-gray transform is considered in this work, 
and the singularity test is performed using a ratio threshold 
followed by an inpainting step if necessary. 

B. Optimization and Individual-Best 3Bs 

In order to obtain the best 3-Bands (3B) selection asso¬ 
ciated with each one of MS image in the given dataset, an 
evolutionary optimizer, called Curved Space Optimizer (CSO) 
[16], is considered. In the process to obtain the global best 

4 Available online: http://www.synchromedia.ca/databases/msi-histodoc 




3B, the optimizer visits various 3B values. In this work, in 
addition to the global optimal (the best) 3B associated with an 
MS image (and implicitly with a specific kernel method), the 
top tailing 3Bs of that image are also reported. An example of 
the output of the optimizer for the image ‘z30’ of the dataset 
is provided in Table I (considering the LE method as kernel). 


Optimality\B ands 

BandR 

BandG 

Bands 

Global (Individual-Best) 

8 

2 

1 

Sub-Optimali 

6 

2 

6 

Sub-OptimaG 

7 

2 

6 

Sub-OptimaF 

5 

3 

2 


TABLE I: The global optimal and also sub-optimal (tailing) 3Bs of 
the image 4 z30’ in the given dataset. 


C. Selection of the Experts 

Having all individual-best and tailing 3Bs of every MS 
image of a given (sub-)dataset, the following procedure is 
considered to extract the rare and also the frequent 3Bs in 
order to build the set of multiple experts required by the 
framework. First, for every image, the ranked list of the 3Bs are 
descendingly weighted with integer numbers starting from 0. 
Then, all the weighted 3Bs of all images are aggregated using 
the summation function. The rare instances are collected by 
choosing those 3Bs that have a sum of zero. In addition, a 
number of most frequent 3Bs are selected by choosing those 
that have the lowest negative sum (i.e., the highest absolute 
sum). In this work, we consider up to a number of 5 most 
frequent 3Bs. The odd-sized union of rare and frequent 3Bs 
sets is considered to get the final set of selected experts. 

V. Cross-Validation Search (CVS) Extension 

In the experimental section, Section VI, we will use the 
cross validation approach to evaluate how much the proposed 
framework is generalizable considering the small size of the 
ground truth data. However, the next challenge would be 
how to select a proper subset of data that can be used to 
process unseen, not-yet-available data (probably the data of 
an upcoming contest), especially when the ground truth data 
is small that is the case in this work. 5 Here, we first discuss this 
challenge, and then we propose another approach to perform 
such a selection in a fair way, i.e., maximizing the performance 
while avoiding possible over fitting. The proposed extension 
is called the Cross-Validation Search (CVS). 

Let us review the notation of a p-holdout cross validation 
(0 < p < 1). Assuming that the ‘given’ ground truth data is 
of a total size of N, the ‘training’ data used in every iteration 
of the cross-validation process would be a randomly selected 
subset of the given data with a size of (1 — p)N. The rest of 
the data, i.e., a subset with a size of pN would play the role 
of the ‘validation’ data in that particular iteration. If a number 
of Ncv iterations is performed, the mean and the standard 
variation values of a measure, such as the mean F-measure on 
the validation data, could be used for the purpose of validating 

5 When the size of available data is small, cross-validation approaches are 
common in order to validate a methodology [17] or a hypothesis [18]. 


whether the method under study is generalizable. We will 
follow this procedure to validate the proposed framework. 

The next question would be how to choose a subset of 
‘given’ data to be used in processing unseen, upcoming new 
data. Various strategies could be used: i) Minimal standard 
variation on the validation subset: Although this well-known 
approach is fair and implicitly searches for the most general¬ 
izable set by selecting the easiest validation subset, it could 
default on itself when the size of the given data is small, ii) 
Maximal performance on the validation data: There is again a 
high chance of low performance especially because usually a 
p < 0.5 is selected, iii) Maximal performance on the whole 
given data: Here, there is a high risk of over fitting, and, iv) 
Using the whole given data as the training data (p = 0): There 
is a chance of both over fitting and also low performance 
even on the given data, especially in the case of multi-expert 
methods. The last approach is denoted All 3Bs in this work. 

Here, we propose to use an extension, called Cross- 
Validation Search (CVS), in the form of a cross-validation 
measure limited to the validation subset in order to avoid over 
fitting while searching for maximum performance. To define 
such a measure, we assume that, for each member of the given 
data, the F-measure scores of three experts are available: i) a 
shared ‘typical’ expert, ii) an upper-bound expert, and iii) the 
multiple-expert method under study. It is worth mentioning 
that there is probably a different upper-bound (individual- 
best) expert for each member of the given set. In this work, 
which is limited to 3-band subspace experts, the shared typical 
expert is assumed to be the trivial RGB-band expert, and the 
individual-best experts are also simple 3-band experts (without 
any combination of 3Bs). The three average performance of 
the three experts is then is calculated on the validation subset: 

FM/c = ^FMtyp^, FM bes , FMjnui^^ , (1) 

where FM beSj fc, and FM^^ are the average F- 

measure performance of the typical expert, the individual-best 
expert(s), and the multiple-expert under study on the validation 
subset of a particular iteration k , respectively. The proposed 
CVS measure of the iteration k is then defined as follows: 

cvSfc := (FMmui^ - FMtyp^ - ^FM beSj fc - FM mu i ^ , 

( 2 ) 

=2FM mu i 5 / c - FMtyp^ - FM bes? fc. (3) 

The first term in Equation (2) represents how much the method 
is better than typical expert, while the second term measures 
its perfectness. Therefore, the whole CVS/c calculates the 
goodness of a training subset on its associated validation subset 
relative to the upper and lower bounds given by the typical 
and individual-best experts. Instead of a ratio, the difference 
is used in order to avoid sensitivity to small improvements. It 
could be argued that a training subset k which provides a high 
CVSfc value would also have a good performance on itself. 
In the proposed CVS approach, the particular training subset 
associated with the iteration with highest CVS/~ is selected to 
build the final multiple-expert method for upcoming inputs: 

k* = argmax CVS/- (4) 

k= ,Ncv 



















Case\Measure 

FMavg 

FMstd 

FMavg, l 

FM s td, i 

Individual Best 

. 80.81 ; 

5.37 

81.49 

4.48 

RGB Bands 

169.58 ; 

19.91 

72.30 

15.90 

All 3Bs 

73.23 

14.66 

75.86 

8.54 


TABLE II: The performance of the individual-best 3B binarization 
wrapper with the LE method as the kernel, and also that of all the 3Bs 
of all 21 MS images combined using the method of Section IV-C. 


The 3Bs associated to the members of the training subset of 
iteration k* are used to build a multiple-expert method with 
an estimated CVS performance of CVSfc*. 

It is worth mentioning that we carry out the iteration 
process twice. First time, it is used to validate a method 
under study the same way it is performed in a standard 
cross-validation process, and in each iteration a pure random 
selection is used. The second time, it calculates the optimal 
iteration k* 9 and instead of using a pure random selection, we 
use the same heuristics optimizer algorithm of [16] to control 
the selection process of training and validation subsets. For 
the purpose of simplicity of the notation, the same number of 
iteration is used for this rerun. We argue that the high number 
of possible selections and also the low number of the iterations 
performed would lead a pure random selection process to 
settle with a sub-optimal or even non-optimal result. Using an 
optimizer could be imagined as setting Ncv oo. However, 
it should be again mentioned that the average statistics of 
the cross validation, provided in the first row of Table III in 
the next section, are calculated using a completely random 
selection with Nqy = 50, and no optimization was performed. 

VI. Experimental Results and Discussions 

In Table II, the performance of the binarization wrapper 
introduced in Section IV-A is presented using the LE bina¬ 
rization method as the kernel. First, using the evolutionary 
optimization algorithm of [16], the best 3B is determined for 
every multispectral image in the dataset. The performance of 
these individual-best 3Bs on the whole dataset is provided 
in the first row of the table. The FM avg , FM st( i, FM avg) i, 
and FMst^i are the average of the F-measure, the standard 
deviation of the F-measure, the average of the F-measure 
excluding the worst image, and the standard deviation of the 
F-measure excluding the worst image, respectively. For the 
purpose of comparison and also to have a ‘typical’ way of 
selecting the 3 bands, the case of RGB bands is provided in the 
second row. As can be seen, there is a big difference (around 
10%) between their performance. In the third row, labeled All 
3Bs, the performance of the proposed multiple 3Bs framework 
is provided in a case where the best 3B of all given images 
along with their tailing best 3Bs are combined as described in 
Section IV-C. As discussed in Section V, the performance of 
the All 3Bs case could not be guaranteed to be generalizable. 

The results obtained using the proposed CVS approach are 
provided in Tables III and IV in comparison with those of 
minimum standard deviation approach. In Table III, for the 
case of p = 0.2, the k* iteration achieved a performance 
comparable to that of the All 3Bs case. Interestingly, in Table 
IV, an improved performance of AFM avg = 1.33% compared 
to the minimum standard deviation approach was achieved 


Case\Measure 

FMavg 

FM s ,d 

FMavg, 1 

FM s td, i 

CVS 

Average CV T 

77.19 

6.50 

80.14 

2.73 

-0.05 

k* T 

; 72.23 ! 

17.07 

80.40 

6.02 

rJ 3.65 : 

RGB Bands 14 

; 62.38 I 

32.25 

- 

- 


Individual BesF’* 

>78.44 l 

8.47 

- 

- 

- 

k* 

73.93 

14.05 

76.46 

8.91 

3.65 

All 3Bs 

73.23 

14.66 

75.86 

8.54 

- 

RGB Bands 

69.58 

19.91 

72.30 

15.90 

- 

Individual Best 

80.81 

5.37 

81.49 

4.48 

- 


TABLE III: The performance of the CVS approach (p = 0.2). Notes: 
^The performance presented is associated with the ‘validation’ subset. 
*This performance is associated with the k* iteration. 




Proposed CVS 

Minimum Std 

p\Measure 

#+ 

FMavg 

FMstd 

CVSfc. 

FMavg 

FMstd 

^3 

II 

O 

o 

19 

73.27 

15.05 

10.02 

73.48 

14.65 

p = 0.20 

17 

73.93 

14.50 

3.65 

73.57 

14.53 

p = 0.50 

11 

74.82 

11.61 

3.59 

73.49 

14.35 

p — 0.90 

3 

75.05 

13.40 

2.41 

74.49 

13.24 

p = 0.97 

1 

76.32 

11.05 

0.98 

74.01 

11.64 


TABLE IV: The performance of the proposed CVS approach against 
that of the minimal standard deviation across the parameter p (the 
LE method is the kernel). Note: %his column denotes the number of 
sample images used in the training subset. 


with a smaller training subset size (p = 0.5). It is worth noting 
that the case of p = 0.97 has only 1 image in the training 
subset (resulted in a multiple-expert of five 3Bs). 




CVS 



CVS 

P 

# f 

FMavg 

FMstd 

CVSfc* 

p 

# 

FMavg 

FMstd 

CVSfc. 

0.10 

19 

76.72 

7.85 

3.43 

0.60 

9 

76.38 

8.34 

-0.24 

0.20 

17 

76.70 

7.82 

1.35 

0.90 

3 

76.99 

7.22 

0.60 

0.50 

11 

76.54 

7.58 

0.36 

0.97 

1 

77.29 

7.20 

-0.56 



Individual Best 



RGB Bands 

P 

# 

FMavg 

FMstd 

CVSfc* 

V 

# 

FMavg 

FMstd 

CVSfc. 

- 

21 

79.69 

6.36 

- 

- 

1 

74.80 

9.86 

- 


TABLE V: The performance with the PC method as the kernel. 


The same procedure was carried out using the PC method 
as the kernel; the results are reported in Table V. Less variation 
across the p values that can be attributed to the more adapt¬ 
ability of the kernel’s internal processes. The multiple-3Bs 
binarization methods developed using the proposed framework 
with p = 0.97 (considering the LE method or PC method as 
the kernel method) will be used as baseline methods in the 
ICDAR 2015 MultiSpectral Text Extraction Contest [19]. 6 

On another dataset of 9 multispectral images from [3], a 
FM avg) i score of 78.76 was obtained with the CVS (p = 0.97, 

6 http://www.synchromedia.ca/system/files/MSTEx_ICDAR15_CFP.pdf, 
http://www.synchromedia.ca/competition/ICDAR/mstexicdar2015.html. 








































































































Fig. 3: A subjective evaluation, a) z67, 2 nd dataset [3]. b) Result of 
[20]. c) Result of the CVS (p = 0.97, PC). 


LE) compared to a FM avg? i of 73.40 obtained using RGB- 
bands and the LE kernel, and to a FM avgj i score of 79.53 
obtained by the best reported method [20] (see Figure 3). We 
also obtained a FM avgj i of 79.01 with the CVS combined with 
a generalization of the LE method inspired from [20]. 

Finally, it is worth mentioning that although the multiple- 
expert method would explicitly use all the best set of parame¬ 
ters of every member of the ‘training’ data, it would not actu¬ 
ally perform any optimization or ‘tuning’ toward maximizing 
the performance on the whole training data; a multiple-expert 
way of augmenting the individual-best parameter sets could 
result in downgraded performance. 

VIE Conclusions and Future Prospects 

A binarization framework for multispectral images based 
on multiple-expert 3-band selection has been proposed. The 
framework comprised of a binarization wrapper, an optimizer, 
and an expert selection process. It receives a dataset of ground 
truth images and a gray-level binarization (kernel), and then 
generates an ensemble of experts in the form of three-band 
subspace selections that can be used along the wrapper to 
binarize any new input image. In addition, a generalized cross- 
validation approach is introduced to minimize the side-effects 
of small size of ground truth datasets. The framework and 
the cross-validation approach have been applied to a ground 
truth multispectral-image dataset along with two state-of-the- 
art gray-level binarization methods with promising results. 

In future, i) impact of other color-to-gray conversions, ii) 
more than 3 (and also variable) number of bands, iii) extension 
of the experts to cover internal parameters of the binarization 
kernels, and iv) generalized measures (of members instead of 
averages) for the CVS approach will be considered. Study of 
the impact of the evolutionary optimization in cross-validation 
partitioning toward maximizing the CVS measure on other 
datasets, along with integration of quality-aware ensemble-of- 
expert reduction approaches to reduce the size of the selected- 
experts set are other directions to be investigated. 
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Supplementary Material 

A. Full Description of the Multispectral Datasets 

Although the dataset used for training and testing the 
proposed framework has been described in Section III, a more 
detailed description of all datasets involved in this work is 
provided here. 

1) The 20MS Dataset [3]: This dataset is the main dataset 
used in this work, and it is provided in [3]. It contains 20 
multispectral (MS) images. Therefore, we call this dataset 
20MS in short. The MS images of the 20MS dataset are 
available to public at the following web page: http://www. 
synchromedia.ca/databases/msi-histodoc, under the file name 
S-MS_l.zip. 7 The results presented in Tables III, IV, and V 
have been generated using this dataset. 

In the text, the configuration used along the proposed 
framework is usually denoted in the form of a tuple. For 
example, when we talk about the CVS (p = 0.97, LE) 
configuration, it means that we use the CVS selection, a p value 
of 0.97, and the LE kernel along the proposed framework. It 
is worth mentioning that there is another element in the tuple 
that has been ignored in the text. This element denotes the 
dataset used for training. Because we only used the 20MS 
dataset as the training set, the associated element has been 
dropped from the tuple. In full form, the example configuration 
mentioned above would be denoted the CVS (p = 0.97, LE, 
20MS dataset) configuration. 

2) The 9MS Dataset [3]: Another MS image dataset has 
been introduced in [3] that contains 9 MS images. We call 
this dataset 9MS, and in the text it has been referred to 
in the comparison with the method introduced in [20]. The 
dataset is available at: http://www.synchromedia.ca/databases/ 
HISTODOC1 , under the file name HISTODOCl.zip. 8 

3) The 10MS Dataset [19]: Along with the ICDAR 2015 
MultiSpectral Text Extraction Contest [19], 4 a separate MS 
image dataset containing 10 MS images has been developed. 
Because of the timeline of the contest, the dataset has not been 
available to public at the time we write this paper. 

4) The 3MS Dataset: In [S4], another dataset of 3 MS 
images were introduced for the purpose of invisible text 
detection, which we call the 3MS dataset. The MS images of 
the 3MS dataset could be retrieved by contacting the authors 
of [S4]. 

B. Subjective Results and Comparison 

Because of the limited space in the main paper, here we 
provide some examples of the performance of the proposed 
framework on the images from various datasets. Also, a subjec¬ 
tive comparison with the results of previously published work 
( [20] and [S4]) on binarization and invisible text detection in 
the multispectral document images is presented. 

Figure S-l provides a visual comparison of various meth¬ 
ods on the 9MS dataset of [3] (see Section Supplementary 
Material S.A2 for more information). To be consistent with 
the results reported in [20], two of the multispectral images, 
namely z67 and z95, were chosen in this figure. The second 

7 http://www.synchromedia.ca/system/files/S-MS_Lzip. 

8 http://www.synchromedia.ca/system/files/HISTODOCLzip. 


band of the input images, the output reported in [20], the 
output of the proposed framework in CVS (p = 0.97, LE) 
configuration, the output of the CVS (p = 0.97, LE + 
[20]) configuration, the output of the CVS (p = 0.97, PC) 
configuration, and the ground truth images are shown in the 
figure rows, respectively. In particular, it is worth mentioning 
that some methods receive low scores, especially the CVS 
(p = 0.97, PC) configuration, for the input z67 (Figure S- 
1(a)) because of a black frame around this image that does 
not actually exist on the manuscript. 

In Figure S-2, another subjective comparison is provided in 
which the performance of the proposed framework in various 
CVS combinations is compared with a state-of-the-art invisible 
text detection method [S4]. A multispectral image from [S4] 
is considered, and the outputs of the CVS (p = 0.97, LE) 
configuration, the ground truth, and the result reported in [S4] 
are visually compared. It is worth mentioning that the goal of 
the text detection method is not binarization at the pixel level, 
and it is more toward better visualization for human expert. 
This is in contrast to the 20MS dataset used in training of the 
proposed framework’s configurations. 

C. The Generalization of the LE Method using [20] 

In Section VI, the performance of the proposed framework 
has been compared with the best reported method in [20] on 
the 9MS dataset. Various binarization kernels have been con¬ 
sidered in the framework, such as the LE and PC binarization 
methods described in Section II-A. In addition, a generalization 
binarization kernel, which was obtained by merging the LE 
method and the method of [20], has been referred to. Here, 
we briefly describe this combined kernel. 

Based on the observation of [20], the 7 th and 8 th bands 
of the MS images have been considered as the source of 
background information. Therefore, two variables are defined. 
The first variable is the background image, / B g, obtained by 
calculating the pixel-wise mean of the 7 th and 8 th bands. 
The second variable is the gray image, /, defined in the 
first two steps of the binarization wrapper, i.e., applying 
blurring/deblurring and the to-gray transform on the 3-band 
selection of the input MS image (as described in Section IV-A). 
Then, / BG is adjusted to have its histogram aligned with that 
of I. Finally, / is modified by removing / BG in a weighted 
approach to calculate the final intermediate gray image. The 
rest of the processes are the same as shown in Figure 2. 

D. Color to Gray: Multi-band to Single-band Conversion 

As has been discussed in Section IV-A, the proposed bi¬ 
narization wrapper requires a color-to-gray transform. Various 
transforms are briefly listed below: 

1) Luminance: This transform attempts to encode the color 
information in the output gray-level image [9]: 

I\um{x) = 1- ^0.27^ 4 (x) + 0.67^3(x) 0.06u 2 (aO), 

where I\ um is the output gray image calculated in the 
BW10 protocol. For a traditional color image, u^, us , and 
U 2 bands are equivalent to 'URed, ^Greem and ub\ U q bands. 

2) Green: In this transform, only the ‘green’ band is used: 

h re( X ) = 1 - U 3 (x). 



3) Average: The output is the average of all three visible 

bands: / av % {x) = 1 — 1/3 ^(rr). 

4) Min-Average: A combination of the average of the bands 



5) Information Insensetive: This nonlinear conversion first 
rotates the color image in the RGB color space in such 
a way that the information difference between any of its 
two projections on the color axes is minimal [10]. Then, 
that projection which has the minimal intensity variation 
in the textual regions is selected as the output. 
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(a) z67, 2 nd dataset [3] (the second band is shown). 



(b) z95, 2 nd dataset [3] (the second band is shown). 
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(i) Result of the CVS (p = 0.97, PC) on (a); FM=0.55. 
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(k) The GT of (a) [3]. 


(1) The GT of (b) [3]. 


Fig. S-l: The subjective evaluation of some of the CVS combinations along with the method reported in [20] on two images from the second 
dataset of [3]. 































(a) MS image of Fig. 4(al) in [S4]. 
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(c) Ground truth of (a); 


(d) Result of [S4] on (a); FM=0.44. The image is in BW10. 
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(e) MS image of Figure 4(a2) in [S4], 


(f) Result of the CVS (p = 0.97, PC) on (e); 
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(h) Result of [S4] on (e); The image is in BW10. 


Fig. S-2 : The subjective evaluation of some of the CVS combinations along with the method reported in [S4] on the multispectral images of 
that paper. 


























Postscript 

A. Evaluation on the MS-TEx 2015 Dataset 

In this Postscript, the results of applying one instance of the 
proposed framework to the dataset of the MS-TEx 2015 contest 
[19] is provided. This dataset, referred to as 10MS dataset in 
Supplementary Material S.A3, was not available at the time 
of submitting this paper. We performed a comparison between 
CVS(p=0.5, PC) and the contest’s participants along with two 
state-of-the-art binarization methods: 1) The Laplacian-Energy 
method [12] and 2) FAIR method [PI]. The full description of 
the contest and also its results are provided in [19]. 

B. Participants in ICDAR 2015’s MS-TEx 2015 Contest 

The full description of the contest and the participants is 
available in [19]. Here, a briefed description of the partici¬ 
pating methods is recalled for the purpose of discussing the 
results. Also, for clarity in referring to methods, a 4-letter code 
name is assigned to each method. 

1) Computer Vision Lab. Vienna University of Technology (Markus Diem, 
Fabian Hollaus, and Robert Sablatnig) [DHSA]: This method incorporates three 
methods for MultiSpectral Text Extraction: 1) thresholding a cleaned channel using the 
Lu et al. [P2] binarization, 2) training an Adaptive Coherence Estimator (ACE) proposed 
by Scharf and Whorter [P3], and 3) combining the cleaned channel with the mean and 
standard deviation images and perform a GrabCut [P4]. In order to compute a cleaned 
channel, the background channel F8 (IR4 band) is removed from a visible channel F2 
(Blue band). 9 

2) Computer Vision Lab. Vienna University of Technology (Fabian Hollaus, 
Markus Diem, and Robert Sablatnig) [HDSA]: In the first step of this method, the 
binarization method of Lu et al. [P2] is applied on the Blue band (F2). The output of 
this method is used for the estimation of the mean spectral signature of the writing. This 
signature is used to train the Adaptive Coherence Estimator (ACE), which is suggested 
by Scharf and Whorter [P3]. The resulting binary image is then finally combined with 
the output of the binarization method of Lu et al [P2]. 10 

3) Document Image and Pattern Analysis (DIPA) Center, Islamabad, Pakistan 
(Ahsen Raza) [RAZA]: Ths method is based on four main steps: 1) performing image 
fusion using wavelet transform-based image fusion technique, 2) performing a conditional 
noise removal procedure using a mix of noise removal filters, 3) performing a window- 
(of size 5x5) based thresholding using a modified form of Niblack’s thresholding 
technique [P5], and 4) performing conditional noise removal followed by image cleaning 
based on aspect ratio of the connected components. 

4) Institute of Automation, Chinese Academy of Sciences (Alex Zhang and 
Cheng-Lin Liu) [ZHLI]: The key of this method is to binarize images by a graph- 
based semi-supervised classification method: 1) extracting edges from the normal image 
(F2) using Canny edge detector, 2) coarse classification by rules, 3) fine classification 
by graph-based semi-supervised learning, and 4) removing the noise using F7 and F8 
multispectral bands. 

5) Information Sciences Institute, University of South California (Yue Wu, 
Stephen Rawls, Wael Abd-Almageed, and Premkumar Natarajan) [WRAN]: this 
method is composed of four major stages: 1) parameter estimations, 2) feature extraction, 
3) classification, and 4) refinement. Various parameters, such as text stroke width, noise 
level, edge map, among others are estimated. The method uses various statistics across 
all the spectrum images, between pairs of the spectrum images, and also single spectrum 
images. Moreover, all spectrum images are binarized via a supervised base model trained 
on the DIBCO datasets. Finally, the method applies a learned refine classifier based on 
connected components analysis. 


C. Evaluation Measures and Ranking 

In order to conform with the protocol of the MS-TEx 
2015 Contest, in addition to the F-measure (FM) metrics [P6], 
[P7], three other performance measures are considered in this 
section: 11 NRM (Negative Rate Metrics) [P9], [P10], DRD 
(Distance Reciprocal Distortion) [Pll], [P12], and Kappa (fc) 
[PI3], [P14]. For the purpose of completeness, all measure are 
defined here: 

9 The source code is available at: 

https://github.com/diemmarkus/MSTEx-CVL.git. 10 The source code is 
available at: https://github.com/hollaus/MSTEx-CVL-matlab. 11 Some of 

these metrics are available from [P8]. 


1) F-measure (FM): The FM metrics is a geometrical 
average between the precision and recall metrics: 

FM = 2 RP/(R + P), (5) 

where R = TP/ (TP + FN), P — TP/ (TP + FP) are the Recall 
and the Precision measures, and TP, FP, TN, and FN represent 
the True Positive, the False Positive, the True Negative, and 
the False Negative counts, respectively [P6], [P7]. For example, 
TP is the number of pixels that are ‘text’ on both the binary 
image being evaluated B and the ground truth image GT. 

2) NRM (Negative Rate Metric): The NRM calculates the 
amount of mismatch with respect to the ground truth: 

NRM = I ^R fn + R fp ) , (6) 

where Rfn = FN/(TP + FN) and Rpp = FP/(FP + TN) are the 
False Negative Rate and the False Positive Rate, respectively 

[P9]. 


3) DRD (Distance Reciprocal Distortion): The DRD met¬ 
rics was proposed to calculate the distortion between binary 
images [Pll], [P12]. For all the T mismatching pixels, it 
computes the associated distortion: 

T 

DRD = DRDj/NUBN, (7) 

1 = 1 

where NUBN is the number of nonuniform (not all black or 
white pixels) 8x8 blocks in the ground truth image. DRDi , 
which corresponds to the distortion of the I th mismatching 
pixel xi [PI2], is defined to be the weighted sum of the pixels 
in the 5x5 block of the ground truth image that differ from 
the value of the mismatching pixel in the binary image B , i.e., 
B(xi). DRDj can be expressed as follows: 

2 2 

DRD/ = Y, E \GT(xi + (i, 3 ))~B(xi)\xW(i,j), (8) 

i= 2 j= 2 

where W is a normalized weight matrix [PI2]. 


4) Kappa (n): The Kappa (/c) coefficient [P13]-[P15], 
which is well known in the domain of remotely sensed 
hyperspectral image classification, estimates the inter-observer 
reliability (reproducibility). It provides a quantitative measure 
of the magnitude of agreement between observers. The cal¬ 
culation is based on the difference between the level actual 
agreement (i.e., the “observed” agreement P Q ) compared to 
that level of chance-only agreement (i.e., the “expected” agree¬ 
ment P e ): 


P Q -P e _ N 0 - N e 
l-P e ~ N - N e ’ 


(9) 


where N 0 , N e , and N are the number of matching pixels 
between GT and B , the sum of direct product of the vectors 
of the number of pixels in black and white classes of GT and 
B , and the total number of pixels, respectively [PI5]. 


Ranking: The ranking method introduced in [PI6] is used. 
In this ranking, for every image in the dataset, best value 
of every metrics among all the participating methods is first 
determined. The participating method with this best value 
receives a score of 1 for the corresponding metrics, and other 
methods are assigned with score less than 1 depending on their 





performance with respect to the best value. Then, the scores of 
every participating method are summed together to calculate 
its final score S: 


10 4 


^ _ x x ( Best jj value/c^j 

k V value/c i j 5 

i=i j—l ’ 


Best 


i,3 


) , k = l,- 

' 7 


,5, (10) 


where k denotes the index of a particular participant, and 
valu Qk,i,j is the value of the metrics number j obtained on 
the test image number i by the participant k. The operator 
(•) returns its first argument for those metrics j that assign 
a lower value to a better performance (such as the DRD), and 
returns its second argument for those cases that the metrics j 
shows a reverse behavior (for example, the FM). At the end, 
the method with the highest score S is considered as the best 
performing method, and so on. 


Rank 

Method 

FM avg ,i 

NRMavg.l 

DRDavg.l 

^avg,l 

S 

j st 

1 [DHSA] 

84.87 

8.704 

3.560 

83.79 

35.12 

2 nd 

2 [HDSA] 

83.29 

9.641 

4.068 

82.13 

33.46 

~yT- 

4 [ZHLI] 

80.14 

11.41 

4.529 

78.79 

31.03 

~5 ur ~ 

5 [WRAN] 

78.49 

12.77 

5.016 

77.16 

29.39 

8 th 

3 [RAZA] 

74.38 

9.774 

8.593 

72.47 

27.18 

rjtfl 

Howe [12] 

75.96 

8.500 

7.806 

74.10 

27.50 

~6 nr ~ 

Lelore [PI] 

69.37 

6.502 

12.36 

66.73 

27.54 

\ th 

CVS(p=0.5,PC) 

73.64 

6.872 

9.556 

71.55 

29.54 


TABLE P-1: The average performance excluding the worst image of 
the methods against the MS-TEx 2015 dataset evaluated using four 
metrics. The rankings are determined using the S scores. 


D. Results and Discusses on the MS-TEx 2015 Dataset 

The performance of the methods are provided in Table P-1. 
Method 1 [DHSA] achieved not only the highest performance 
in terms of the ranking measure S, 12 it also provides the 
best performance in terms of every individual metrics. The 
only exception is the NRM metrics for which FAIR and the 
proposed CVS methods achieve a better performance. It could 
be argued that a priori information that the F8 band provides 
background noise plays a significant role in success of Method 
1 [DHSA]. Similarly, Method 4 [ZHLI], which was the only 
method that successfully removed the ‘stamp’ annotation (as 
can be seen from Figure P-1), again uses the F7 and F8 bands 
in order to remove the noise. Another example for the dataset 
is provided in Figure P-2 (for the MS image z92). 

As can be seen from the table, the CVS method has 
outperformed all methods (except the FAIR method) in terms 
of NRM. In order to visualize this point, the image with highest 
difference between the NRM scores of Method 1 [DHSA] 
and the CVS method is provided in Figure P-3. The latter 
shows a better performance in terms of preserving connectivity 
of strokes. The general low performance of the proposed 
framework could be attributed to small size of the training 
dataset, especially in terms of hidden text and annotations 
examples, which is amplified by the blinded nature of the 

12 The reason that the S values reported in Table P-1 is slightly different from 
those reported in [19] is that here we considered eight binarization methods 
in the raking pool in contrast of seven methods that were considered in [19]. 



Fig. P-1 : Successful removal of the ‘stamp ’mark from the image 
z31 by Method 4 [ZHLI]. 


proposed framework that treat all the bands in the same way. In 
the future, we will further investigate this framework by using 
bigger datasets and also integrating a priori information. 
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Fig. P-2: a) The Green band (F3) of z92 image of the 10MS dataset, b) Ground truth, c) Method 1 [DHSA]. d) The proposed framework with 
the PC binarization method [13] as its kernel and p = 0.50. 



(c) Method 1 [DHSA] (NRM = 13.11) (d) CVS/PC/50 (NRM = 3.751) 

Fig. P-3: The performance in terms of the NRM metrics for the image z65 of the 10MS dataset. 
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